Symmetry as Intervention

An Analysis of Causal Effect Estimation using
Outcome Invariant Data Augmentation arXiv:2510.25128

Uzair Akbar

Georgia Tech

Niki Kilbertus

TU Munich

Hao Shen

TU Munich

Krikamol Muandet

CISPA

Bo Dai

Georgia Tech

November 10, 2025

Motivation

Correlation vs. causation

xkcd.com/925 by Randall Munroe.

  • Can we recover causal effects from observational data?
  • Yes—but only with untestable assumptions and domain knowledge!

This work




We try to answer the fundamental question:

Can knowledge of symmetries in data generation—often used implicitly in certain regularizers—be repurposed to
improve causal effect estimation given
only observational \((X, Y)\) data?

Statistical vs. Causal Estimation

Empirical risk minimization (ERM)

For treatment \(X\), outcome \(Y\) samples generated as

\[ Y = \rfunc{X} + \xi , \qquad \E{\xi} = 0 , \] statistical inference entails recovering the optimal predictor \(\E{ Y \mid X = \vx }\) by minimizing the risk \[ R_{\ERM}( \bh ) := \E{ \sqNorm{ Y - \bhyp{X} } } , \] over hypotheses \(\bh\in\H\) from a rich enough class \(\H\).

For un-correlated \(X\) and noise \(\xi\), the minimizer \(\bh_{\ERM}{\blue{(}} \vx {\blue{)}}\)
coincides with the true causal effect \(\rfunc{\vx}\).

Data augmentation

For finite \(n\) samples \(\D := \{ (\vx_i, \vy_i) \}_{i=0}^n\), regularization
techniques are used to mitigate estimation variance.


E.g., data augmentation (DA) achieves this via multiple random augmentations \((\gG \vx_i, \vy_i)\) per sample in the risk


\[ R_{\DA+\ERM}( \bh ) := \E{ \sqNorm{ Y - \bhyp{ \gG X } } } , \qquad \gG\sim \P_{ \gG } . \]

Confounding bias and spurious correlation

Problem with ERM: Generally, \(X\) and \(\xi\) are correlated.

This makes the ERM minimizer a biased estimator of \(\rf\): \[ \begin{align*} Y &= \rfunc{X} + \xi ,\\ \nonumber \Rightarrow \underbrace{\E{Y \mid X = \vx} }_{\text{ERM minimizer}} &= \rfunc{\vx} + \underbrace{\E{ \xi \mid X = \vx}}_{\text{confounding bias $\neq 0$}} . \end{align*} \] This spurious correlation b/w \(X\) and \(\xi\) arises due to their unobserved common parents, called confounders.

Intervention for causal estimation

Removing correlation b/w \(X\), \(\xi\), requires an intervention—explicitly assigning \(X\) some independently sampled \(\Xtilde\) during data generation, a.k.a. a randomized control trial:

\[ Y = \rfunc{\Xtilde} + \xi \]

Now, doing ERM on samples of \((Y, \Xtilde)\) recovers \(\rf\).

Problem: Often not possible to intervene on real systems.
We only have access to pre-collected observational data.

Instrumental variables (IVs)

To workaround interventions, use instrument \(\gZ\) satisfying

  1. treatment relevance \(\gZ \nindep X\),
  2. exclusion \(\gZ\indep Y \mid X\),
  3. un-confoundedness \(\gZ \indep \xi\),
  4. outcome relevance \(Y \nindep \gZ\),

Conditioning the model on \(\gZ\) gives \(\E{ Y \mid \gZ } = \E{ \rfunc{X} \mid \gZ }\), which can then be solved for \(\rf\) by minimizing the risk \[ R_{\IV}( \bh ) := \E{ \sqNorm{ Y - \E{ \bhyp{X} \mid \gZ } } } . \]

Problem: Instruments are scarce in most application domains.

Causal Estimation with Data Augmentaiton

Data augmentation = model symmetries

We restrict ourselves to DA transformations with respect to which \(\rf\) is invariant. Specifically, \(\gG\) takes values in \(\G\) such that \(\rf\) is \(\G\)-invariant: \[ \rfunc{\vx} = \rfunc{\vg \vx}, \qquad \forall \;\; (\vx, \vg)\in \X\times\G . \] Of course, constructing such DA requires knowledge of symmetries of \(\rf\). E.g., when classifying images \(\vx\) of cats vs. dogs, the true labeling function would certainly be invariant to random image rotations \(\gG\vx\).

Data augmentation = soft intervention

Key insight: When \(\rf\) is \(\G\)-invariant, \((Y, \gG X)\) follows the data generation: \[ Y = \rfunc{ \gG X } + \xi . \] Therefore, DA is equivalent to a soft intervention on the treatment \(X\).

\(\Rightarrow\) DA+ERM dominates vanilla ERM on causal estimation error (CER): \[ \CER(\bh) := \E{ \sqNorm{ \rfunc{X} - \bhyp{X} } } , \qquad \boxed{ \CER(\bh_{\DA+\ERM}) \leq \CER(\bh_{\ERM}) } . \]

  • Strictly better when DA perturbes spurious features correlated with \(\xi\).
  • But otherwise performs no worse than ERM.

Data augmentation = relaxed IVs

Key insight: DA params \(\gG\) are IV-like (IVL)—having IV properties (i)—(iii) by design.


Such a relaxation renders IV regression ill-posed, so we suggest IVL regression:

\[ R_{\IVL}(\bh) := R_{\IV}(\bh) + \boxed{\alpha \cdot R_{\ERM}(\bh)} \Big\} {\tiny\text{ERM regularizer for ill-posed IV reg.}} \] \(\Rightarrow\) The composition DA+IVL simulates a worst-case/adversarial DA.

\(\Rightarrow\) DA+IVL dominates DA+ERM; better iff spurious features perturbed. \[ \boxed{\CER(\bh_{\DA+\IVL}) \leq \CER(\bh_{\DA+\ERM})} \]

Data augmentation = causal regularization

Causal regularization: Methods that impove causal estimation despite full identification of \(\rf\) not being possible.

Why bother?

  • No-regret improvement: Under our symmetry based DA construction, DA dominates on causal estimation = sometimes better, never worse.
  • Robust prediction: Reducing confounding bias is estimation reduces sensitivity to spurious features, allowing for more robust predictors under distribution shifts.

Experiments

Simulation Ablations

Simulation experiment with a linear, centered Gaussian model with non-zero \(\rbf\in\R^m\), confounding strength \(\kappa > 0\), and DA strength \(\gamma > 0\), s.t.
\(\qquad\qquad\qquad \gG X := X + \gamma\cdot \gG\qquad\) \(\gG\in\operatorname{null}(\rbf)\).
Normalized CER (nCER) \(=0\) for true \(\rf\) and \(1\) for pure confounding.

Baseline Comparison

Comparison with select causal regularization methods and common domain generalisation baselines. All methods are provided only \((X, Y)\) data along with DA transformations \(\gG\)—Gausian noise in optical device, and hue, saturation, contrast, translation perturbations, in colored-MNIST.

Conclusion

References